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Abstract 

We present a novel method for aligning a se¬ 
quence of instructions to a video of some¬ 
one carrying out a task. In particular, we fo¬ 
cus on the cooking domain, where the instruc¬ 
tions correspond to the recipe. Our technique 
relies on an HMM to align the recipe steps 
to the (automatically generated) speech tran¬ 
script. We then refine this alignment using 
a state-of-the-art visual food detector, based 
on a deep convolutional neural network. We 
show that our technique outperforms simpler 
techniques based on keyword spotting. It also 
enables interesting applications, such as auto¬ 
matically illustrating recipes with keyframes, 
and searching within a video for events of in¬ 
terest. 


1 Introduction 


In recent years, there have been many successful 
attempts to build large “knowledge bases” (KBs), 
such as NELL ( [Carlson et aL, 2010) , KnowItAll ( |Et^ 
zioni et al., 201 Ij ), YAGO ( jSuchanek et al., 2007 ), 
and Google’s Knowledge Graph/ Vault ( [Dong et aU 
2014] ). These KBs mostly focus on declarative facts, 
such as “Barack Obama was born in Hawaii”. But 
human knowledge also encompasses procedural in¬ 
formation not yet within the scope of such declara¬ 
tive KBs - instructions and demonstrations of how to 
dance the tango, for example, or how to change a tire 
on your car. A KB for organizing and retrieving such 
procedural knowledge could be a valuable resource 
for helping people (and potentially even robots - 
e.g., ( jSaxena et al., 2014} [Yang et al., 2015| )) learn 
to perform various tasks. 


In contrast to declarative information, procedural 
knowledge tends to be inherently multimodal. In 
particular, both language and perceptual information 
are typically used to parsimoniously describe proce¬ 
dures, as evidenced by the large number of “how¬ 
to” videos and illustrated guides on the open web. 
To automatically construct a multimodal database 
of procedural knowledge, we thus need tools for 
extracting information from both textual and vi¬ 
sual sources. Crucially, we also need to figure out 
how these various kinds of information, which often 
complement and overlap each other, fit together to a 
form a structured knowledge base of procedures. 

As a small step toward the broader goal of align¬ 
ing language and perception, we focus in this pa¬ 
per on the problem of aligning video depictions of 
procedures to steps in an accompanying text that 
corresponds to the procedure. We focus on the 
cooking domain due to the prevalence of cooking 
videos on the web and the relative ease of inter¬ 
preting their recipes as linear sequences of canon¬ 
ical actions. In this domain, the textual source is 
a user-uploaded recipe attached to the video show¬ 
ing the recipe’s execution. The individual steps of 
procedures are cooking actions like “peel an onion”, 
“slice an onion”, etc. However, our techniques can 
be applied to any domain that has textual instruc¬ 
tions and corresponding videos, including videos 
at sites such as youtube . com, howcast. com, 
howdini.com or videojug.com, 

The approach we take in this paper leverages the 
fact that the speech signal in instructional videos is 
often closely related to the actions that the person 
is performing (which is not true in more general 























videos). Thus we first align the instructional steps 
to the speech signal using an HMM, and then refine 
this alignment by using a state of the art computer 
vision system. 

In summary, our contributions are as follows. 
First, we propose a novel system that combines text, 
speech and vision to perform an alignment between 
textual instructions and instructional videos. Sec¬ 
ond, we use our system to create a large corpus of 
180k aligned recipe-video pairs, and an even larger 
corpus of 1.4M short video clips, each labeled with 
a cooking action and a noun phrase. We evaluate 
the quality of our corpus using human raters. Third, 
we show how we can use our methods to support 
applications such as within-video search and recipe 
auto-illustration. 

2 Data and pre-processing 

We first describe how we collected our corpus of 
recipes and videos, and the pre-processing steps that 
we run before applying our alignment model. The 
corpus of recipes, as well as the results of the align¬ 
ment model, will be made available for download at 
git hub . com/malmaud/what s_cookin, 


2.1 Collecting a large corpus of cooking videos 
with recipes 

We first searched Youtube for videos which 
have been automatically tagged with the Freebase 
mids /m/Olmtb (Cooking) and /m/OpSTp (recipe), 
and which have (automatically produced) English- 
language speech transcripts, which yielded a collec¬ 
tion of 7.4M videos. Of these videos, we kept the 
videos that also had accompanying descriptive text, 
leaving 6.2M videos. 

Sometimes the recipe for a video is included in 
this text description, but sometimes it is stored on 
an external site. For example, a video’s text de¬ 
scription might say “Click here for the recipe”. To 
find the recipe in such cases, we look for sentences 
in the video description with any of the following 
keywords: “recipe”, “steps”, “cook”, “procedure”, 
“preparation”, “method”. If we find any such to¬ 
kens, we find any URLs that are mentioned in the 
same sentence, and extract the corresponding docu¬ 
ment, giving us an additional 206k documents. We 
then combine the original descriptive text with any 


Class 

Precision 

Recall 

FI 

Background 

0.97 

0.95 

0.96 

Ingredient 

0.93 

0.95 

0.94 

Recipe step 

0.94 

0.95 

0.94 


Table 1: Test set performance of text-based recipe classifier. 


additional text that we retrieve in this way. 

Finally, in order to extract the recipe from the text 
description of a video, we trained a classifier that 
classifies each sentence into 1 of 3 classes: recipe 
step, recipe ingredient, or background. We keep 
only the videos which have at least one ingredient 
sentence and at least one recipe sentence. This last 
step leaves us with 180,000 videos. 

To train the recipe classifier, we need labeled 
examples, which we obtain by exploiting the fact 
that many text webpages containing recipes use 
the machine-readable markup defined at http: 
//schema . org/Recipe, From this we extract 
500k examples of recipe sentences, and 500k exam¬ 
ples of ingredient sentences. We also sample 500k 
sentences at random from webpages to represent the 
non-recipe class. Finally, we train a 3-class naive 
Bayes model on this data using simple bag-of-words 
feature vectors. The performance of this model on a 
separate test set is shown in Table 


2.2 Parsing the recipe text 


For each recipe, we apply a suite of in-house NLP 
tools, similar to the Stanford Core NLP pipeline. In 
particular, we perform POS tagging, entity chunk¬ 
ing, and constitu ency parsing (bas ed on a re¬ 
implementation of ( Petrov et al., 2006| ))p] Following 
( [Druck and Pang, 2012 ), we use the parse tree struc¬ 
ture to partition each sentence into “micro steps”. In 
particular, we split at any token categorized by the 
parser as a conjunction only if that token’s parent in 
the sentence’s constituency parse is a verb phrase. 
Any recipe step that is missing a verb is considered 
noise and discarded. 

We then label each recipe step with an optional 
action and a list of 0 or more noun chunks. The ac- 


^ Sometimes the parser performs poorly, because the lan¬ 
guage used in recipes is often full of imperative sentences, such 
as “Mix the fiour”, whereas the parser is trained on newswire 
text. As a simple heuristic for overcoming this, we classify any 
token at the beginning of a sentence as a verb if it lexically 
matches a manually-defined list of cooking-related verbs. 










tion label is the lemmatized version of the head verb 
of the recipe step. We look at all chunked noun en¬ 
tities in the step which are the direct object of the 
action (either directly or via the preposition “of”, as 
in “Add a cup of flour”). 

We canonicalize these entities by computing their 
similarity to the list of ingredients associated with 
this recipe. If an ingredient is sufficiently similar, 
that ingredient is added to this step’s entity list. Oth¬ 
erwise, the stemmed entity is used. For example, 
consider the step “Mix tomato sauce and pasta”; if 
the recipe has a known ingredient called “spaghetti”, 
we would label the action as “mix” and the entities 
as “tomato sauce” and “spaghetti”, because of its 
high semantic similarity to “pasta”. (Semantic sim¬ 
ilarity is estimated based on Euclidean distance be¬ 
tween word embedding vectors computed using the 
method of ( [Mikolov et al., 2013] ) trained on general 
web text.) 

In many cases, the direct object of a transitive verb 
is elided (not explicitly stated); this is known as the 
“zero anaphora” problem. For example, the text may 
say “Add eggs and flour to the bowl. Mix well.”. The 
object of the verb “mix” is clearly the stuff that was 
just added to the bowl (namely the eggs and flour), 
although this is not explicitly stated. To handle this, 
we use a simple recency heuristic, and insert the en¬ 
tities from the previous step to the current step. 



Figure 1: Graphical model representation of the factored 
HMM. See text for details. 

of error, we also collected a much smaller set of 480 
cooking videos (with corresponding recipe text) for 
which the video creator had uploaded a manually 
curated speech transcript; this has no transcription 
errors, it contains sentence boundary markers, and 
it also aligns whole phrases with the video (instead 
of just single tokens). We applied the same NLP 
pipeline to these manual transcripts. In the results 
section, we will see that the accuracy of our end-to- 
end system is indeed higher when the speech tran¬ 
script is error-free and well-formed. However, we 
can still get good results using noisier, automatically 
produced transcripts. 

3 Methods 

In this section, we describe our system for aligning 
instructional text and video. 


2.3 Processing the speech transcript 

The output of Youtube’s ASR system is a sequence 
of time-stamped tokens, produced by a standard 
Viterbi decoding system. We concatenate these to¬ 
kens into a single long document, and then apply our 
NLP pipeline to it. Note that, in addition to errors in¬ 
troduced by the ASR systenj^ the NLP system can 
introduce additional errors, because it does not work 
well on text that may be ungranunatical and which is 
entirely devoid of punctuation and sentence bound¬ 
ary markers. 

To assess the impact of these combined sources 


^ According to |Liao et al., 2013) , the Youtube ASR system 
we used, based on using Gaussian mixture models for the acous¬ 
tic model, has a word error rate of about 52% (averaged over all 
English-language videos; some genres, such as news, had lower 
error rates). The newer system, which uses deep neural nets for 
the acoustic model, has an average WER of 44%; however, this 
was not available to us at the time we did our experiments. 


3.1 HMM to align recipe with ASR transcript 

We align each step of the recipe to a corresponding 
sequence of words in the ASR transcript by using the 
input-output HMM shown in FigureHere X(1 : 
K) represents the textual recipe steps (obtained us¬ 
ing the process described in Section [2^ ; Y{1 : T) 
represent the ASR tokens (spoken words); R{t) E 
{1,..., iT} is the recipe step number for frame f; 
and B{t) E {0,1} represents whether timestep t is 
generated by the background (S = 1) or foreground 
model (S = 0). This background variable is needed 
since sometimes sequences of spoken words are un¬ 
related to the content of the recipe, especially at the 
beginning and end of a video. 

The conditional probability distributions (CPDs) 


for the Markov chain is as follows: 
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This encodes our assumption that the video fol¬ 
lows the same ordering as the recipe and that back¬ 
ground/foreground tokens tend to cluster together. 
Obviously these assumptions do not always hold, 
but they are a reasonable approximation. 

For each recipe, we set cr = KjT, the ratio of 
recipe steps to transcript tokens. This setting corre¬ 
sponds to an a priori belief that each recipe step is 
aligned with the same number of transcript tokens. 
The parameter 7 in our experiments is set by cross- 
validation to 0.7 based on a small set of manually- 
labeled recipes. 

For the foreground observation model, we gener¬ 
ate the observed word from the corresponding recipe 
step via: 

logp(y(t) = y\R{t) = k,X(l : K),B{t) = 0) oc 
max({WordSimilarity(y,x) : x G X(/c)}), 

where X{k) is the set of words in the /c’th recipe 
step, and WordSimilarity(5, t) is a measure of simi¬ 
larity between words s and f, based on word vector 
distance. 

If this frame is aligned to the background, we 
generate it from the empirical distribution of words, 
which is estimated based on pooling all the data: 

p{Y{t) = y\R{t) = k,B{t) = 1) =p{y). 

Finally, the prior for is uniform, and 

p{R{l)) is set to a delta function on i?(l) = 1 (i.e., 
we assume videos start at step 1 of the recipe). 

Having defined the model, we “flatten” it to a 
standard HMM (by taking the cross product of Rt 
and Bf), then estimate the MAP sequence using the 
Viterbi algorithm. See Figure]^ for an example. 

Finally, we label each segment of the video as 
follows: use the segmentation induced by the align¬ 
ment, and extract the action and object from the cor¬ 
responding recipe step as described in Section |2.2| 
If the segment was labeled as background by the 
HMM, we do not apply any label to it. 

3.2 Keyword spotting 

A simpler approach to labeling video segments is to 
just search for verbs in the ASR transcript, and then 
to extract a fixed-sized window around the times¬ 
tamp where the keyword occurred. We call this ap¬ 
proach “keyword spotting”. A similar method from 


( |Yu et al., 2014[ ) filters ASR transcripts by part-of- 
speech tag and finds tokens that match a small vo¬ 
cabulary to create a corpus of video clips (extracted 
from instructional videos), each labeled with an ac¬ 
tion/object pair. 

In more detail, we manually define a whitelist of 
^200 actions (all transitive verbs) of interest, such 
as “add”, “chop”, “fry”, etc. We then identify when 
these words are spoken (relying on the POS tags to 
filter out non-verbs), and extract an 8 second video 
clip around this timestamp. (Using 2 seconds prior 
to the action being mentioned, and 6 seconds follow¬ 
ing.) To extract the object, we take all tokens tagged 
as “noun” within 5 tokens after the action. 


3.3 Hybrid HMM + keyword spotting 

We cannot use keyword spotting if the goal is to 
align instructional text to videos. However, if our 
goal is just to create a labeled corpus of video clips, 
keyword spotting is a reasonable approach. Unfor¬ 
tunately, we noticed that the quality of the labels 
(especially the object labels) generated by keyword 
spotting was not very high, due to errors in the ASR. 
On the other hand, we also noticed that the recall of 
the HMM approach was about 5 times lower than us¬ 
ing keyword spotting, and furthermore, that the tem¬ 
poral localization accuracy was sometimes worse. 

To get the best of both worlds, we employ the fol¬ 
lowing hybrid technique. We perform keyword spot¬ 
ting for the action in the ASR transcript as before, 
but use the HMM alignment to infer the correspond¬ 
ing object. To avoid false positives, we only use 
the output of the HMM for this video if at least half 
of the recipe steps are aligned by it to the speech 
transcript; otherwise we back off to the baseline ap¬ 
proach of extracting the noun phrase from the ASR 
transcript in the window after the verb. 

3.4 Temporal refinement using vision 

In our experiments, we noticed that sometimes the 
narrator describes an action before actually perform¬ 
ing it (this was also noted in ( |Yu et al., 2014| )). To 
partially combat this problem, we used computer vi¬ 
sion to refine candidate video segments as follows. 
We first trained visual detectors for a large collec¬ 
tion of food items (described below). Then, given 
a candidate video segment annotated with an ac¬ 
tion/object pair (coming from any of the previous 







Recipe Steps Automatic Speech Transcription 


1: In a bowl combine flour, chilli powder, cumin, paprika and five spice. Once thoroughly 
mixed, add in chicken strips and coat in mixture. 

2: Heat oil in a wok or large pan on medium to high heat. Add in chicken and cook until 
lightly brown for 3 -- 5 minutes. 

in a bowl combine the flower chili powder paprika cumen and five-spice do 130 mixed 
add in the chicken strips and post in the flour mixture he's oil in a walk for large pan on 
medium to high heat add in the chicken and cook until lightly browned for three to five 
minutes add in chopped vegetables along with the garlic lime juice hot sauce and 

3: Add in chopped vegetables along with garlic, lime juice, hot sauce and Worcestershire 

Worcestershire sauce dome cook for a further 15 minutes on medium peace and the 

sauce. 

4: Cook for a further 15 minutes on medium heat. 

mixture coax chop the tomatoes and as blessed tomato and cucumber into a serving 
bowl up we've cooked add a spoonful up the fajita mix into a wrap with the salsa and after 

5: As the mixture cooks, chop the tomatoes and add lettuce, and cucumber into a 

yogurt throughout the rack and served with side salad this recipe makes to avalanche 

serving bowl. 

portions done they have just taken but he says and delicious introduction to Mexican 

6: Once cooked, serve fajita mix with whole wheat wrap. Add a spoonful of fajita mix into 
wrap with salsa and natural yogurt. Wrap or roll up the tortilla and serve with side salad. 

flavors blue that 


Step 2 



Video Position 


Figure 2: Examples from a Chicken Fajitas recipe at https : / /www. youtube . com/watch?v=mGpvZE3udQ4 (figure best 
viewed in color). Top: Alignment between (left) recipe steps to (right) automatic speech transcript. Tokens from the ASR are 
allowed to be classified as background steps (see e.g., the uncolored text at the end). Bottom: Detector scores for two ingredients 
as a function of position in the video. 


three methods), we find a translation of the window 
(of up to 3 seconds in either direction) for which the 
average detector score corresponding to the object is 
maximized. The intuition is that by detecting when 
the object in question is visually present in the scene, 
it is more likely that the corresponding action is ac¬ 
tually being performed. 


Training visual food detectors. We trained a 
deep convolutional neural network (CNN) classi¬ 


fier (specifically, the 16 layer VGG model from (Si- 


monyan and Zisserman, 2014)) on the FoodFood- 


101 dataset of ( [Bossard et al., 2014| ), using the Caffe 
open source software ( |Jia et al., 2014| ). The Food- 
101 dataset contains 1000 images for 101 different 
kinds of food. To compensate for the small training 
set, we pretrained the CNN on the ImageNet dataset 
( [Russakovsky et al., 2014| ), which has 1.2M images, 
and then fine-tuned on Food-101. After a few hours 
of fine tuning (using a single GPU), we obtained 
79% classification accuracy (assuming all 101 labels 
are mutually exclusive) on the test set, which is con¬ 
sistent with the state of the art results^ 


^ In particular, the website https : / /www.metamind, 
io/vision/food (accessed on 2/25/15) claims they also got 
79% on this dataset. This is much better than the 56.4% for a 
CNN reported in ( [Bossard et al., 201^ . We believe the main 
reason for the improved performance is the use of pre-training 
on ImageNet. 


We then trained our model on an internal, propri¬ 
etary dataset of 220 million images harvested from 
Google Images and Flickr. About 20% of these im¬ 
ages contain food, the rest are used to train the back¬ 
ground class. In this set, there are 2809 classes of 
food, including 1005 raw ingredients, such as avo¬ 
cado or beef, and 1804 dishes, such as ratatouille or 
cheeseburger with bacon. We use the model trained 
on this much larger dataset in the current paper, due 
to its increased coverage. (Unfortunately, we cannot 
report quantitative results, since the dataset is very 
noisy (sometimes half of the labels are wrong), so 
we have no ground truth. Nevertheless, qualitative 
behavior is reasonable, and the model does well on 
Food-101, as we discussed above.) 

Visual refinement pipeline. For storage and time 
efficiency, we downsample each video temporally to 
5 frames per second and each frame to 224 x 224 
before applying the CNN. Running the food detector 
on each video then produces a vector of scores (one 
entry for each of 2809 classes) per timeframe. 

There is not a perfect map from the names of 
ingredients to the names of the detector outputs. 
For example, an omelette recipe may say “egg”, 
but there are two kinds of visual detectors, one 
for “scrambled egg” and one for “raw egg”. We 
therefore decided to define the match score between 
an ingredient and a frame by taking the maximum 



























score for that frame over all detectors whose names 
matched any of the ingredient tokens (after lemma- 
tization and stop word filtering). 

Finally, the match score of a video segment to 
an object is computed by taking the average score 
of all frames within that segment. By then scoring 
and maximizing over all translations of the candi¬ 
date segment (of up to three seconds away), we pro¬ 
duce a final “refined” segment. 

3.5 Quantifying confidence via vision and 
affordances 

The output of the keyword spotting and/or HMM 
systems is an (action, object) label assigned to cer¬ 
tain video clips. In order to estimate how much con¬ 
fidence we have in that label (so that we can trade off 
precision and recall), we use a linear combination of 
two quantities: (1) the final match score produced 
by the visual refinement pipeline, which measures 
the visibility of the object in the given video seg¬ 
ment, and (2) an ajfordance probability, measuring 
the probability that o appears as a direct object of a. 

The affordance model allows us to, for example, 
prioritize a segment labeled as (peel, garlic) over a 
segment labeled as (peel, sugar). The probabilities 
P(object = ojaction = a) are estimated by first 
forming an inverse document frequency matrix cap¬ 
turing action/object co-occurrences (treating actions 
as documents). To generalize across actions and ob¬ 
jects we form a low-rank approximation to this IDF 
matrix using a singular value decomposition and set 
affordance probabilities to be proportional to expo¬ 
nentiated entries of the resulting matrix. Figure|^vi- 
sualizes these affordance probabilities for a selected 
subset of frequently used action/object pairs. 

4 Evaluation and applications 

In this section, we experimentally evaluate how well 
our methods work. We then briefly demonstrate 
some prototype applications. 

4.1 Evaluating the clip database 

One of the main outcomes of our process is a set of 
video clips, each of which is labeled with a verb (ac¬ 
tion) and a noun (object). We generated 3 such la¬ 
beled corpora, using 3 different methods: keyword 
spotting (“KW”), the hybrid HMM -i- keyword spot¬ 
ting (“Hybrid”), and the hybrid system with visual 



Figure 3: Visualization of affordance model. Entries (a, o) are 
colored according to P(object = o | action = a). 



Figure 4: Clip quality, as assessed by Mechanical Turk exper¬ 
iments on 900 trials. Figure best viewed in color; see text for 
details. 

food detector (“visual refinement”). The total num¬ 
ber of clips produced by each method is very similar, 
approximately 1.4 million. The coverage of the clips 
is approximately 260k unique (action, noun phrase) 
pairs. 

To evaluate the quality of these methods, we cre¬ 
ated a random subset of 900 clips from each corpus 
using stratified sampling. That is, we picked an ac¬ 
tion uniformly at random, and then picked a corre¬ 
sponding object for that action from its support set 
uniformly at random, and finally picked a clip with 
that (action, object) label uniformly at random from 
the clip corpuses produced in Section]^ this ensures 















































Figure 5: Average clip quality (precision) after filtering out 
low confidence clips versus # clips retained (recall). 



Rating 

Figure 6: Histogram of human ratings comparing recipe steps 
against ASR descriptions of a video clip. “2” indicate a strong 
preference for the recipe step; “-2” a strong preference for the 
transcript. See text for details. 

the test set is not dominated by frequent actions or 
objects. 

We then performed a Mechanical Turk experi¬ 
ment on each test set. Each clip was shown to 3 
raters, and each rater was asked the question “How 
well does this clip show the given action/object?”. 
Raters then had to answer on a 3-point scale: 0 
means “not at all”, 1 means “somewhat”, and 2 
means “very well”. 

The results are shown in Figure We see that 
the quality of the hybrid method is significantly bet¬ 
ter than the baseline keyword spotting method, for 
both actions and objects]^ While a manually curated 

Inter-rater agreement, measured via Fleiss’s kappa by ag¬ 
gregating across all judgment tasks, is .41, which is statistically 
significant at a p < .05 level. 


speech transcript indeed yields better results (see the 
bars labeled ‘manual’), we observe that automati¬ 
cally generated transcripts allow us to perform al¬ 
most as well, especially using our alignment model 
with visual refinement. 

Comparing accuracy on actions against that on 
objects in the same figure, we see that keyword spot¬ 
ting is far more accurate for actions than it is for 
objects (by over 30%). This disparity is not surpris¬ 
ing since keyword spotting searches only for action 
keywords and relies on a rough heuristic to recover 
objects. We also see that using alignment (which 
extracts the object from the “clean” recipe text) and 
visual refinement (which is trained explicitly to de¬ 
tect ingredients) both help to increase the relative ac¬ 
curacy of objects — under the hybrid method, for 
example, the accuracy for actions is only 8% better 
than that of objects. 

Note that clips from the HMM and hybrid meth¬ 
ods varied in length between 2 and 10 seconds 
(mean 4.2 seconds), while clips from the keyword 
spotting method were always exactly 8 seconds. 
Thus clip length is potentially a confounding factor 
in the evaluation when comparing the hybrid method 
to the keyword-spotting method; however, if there is 
a bias to assign higher ratings to longer clips (which 
are a priori more likely to contain a depiction of a 
given action than shorter clips), it would benefit the 
keyword spoting method. 

Segment confidence scores (from Section [33] ) can 
be used to filter out low confidence segments, thus 
improving the precision of clip retrieval at the cost of 
recall. Figure visualizes this trade-off as we vary 
our confidence threshold, showing that indeed, seg¬ 
ments with higher confidences tend to have the high¬ 
est quality as judged by our human raters. More¬ 
over, the top 167,000 segments as ranked by our con¬ 
fidence measure have an average rating exceeding 
1.75. 

We additionally sought to evaluate how well 
recipe steps from the recipe body could serve as 
captions for video clips in comparison to the often 
noisy ASR transript, which serves as a rough proxy 
for evaluating the quality of the alignment model as 
well as demonstration a potential application of our 
method for “cleaning up” noisy ASR captions into 
complete grammatical sentences. To that end, we 
randomly selected 200 clips from our corpus that 









Object dough 


Run Hybrid 


SEARCH 


both have an associated action keyword from the 
transcript as well as an aligned recipe step selected 
by the HMM alignment model. For each clip, three 
raters on Mechanical Turk were shown the clip, the 
text from the recipe step, and a fragment of the ASR 
transcript (the keyword, plus 5 tokens to the left and 
right of the keyword). Raters then indicated which 
description they preferred: 2 indicates a strong pref¬ 
erence for the recipe step, 1 a weak preference, 0 
indifference, -la weak preference for the transcript 
fragment, and -2 a strong preference. Results are 
shown in Figure Excluding raters who indicated 
indiffierence, 67% of raters preferred the recipe step 
as the clip’s description. 

A potential confound for using this analysis as 
a proxy for the quality of the alignment model is 
that the ASR transcript is generally an ungrammat¬ 
ical sentence fragment as opposed to the grammati¬ 
cal recipe steps, which is likely to reduce the raters’ 
approval of ASR captions in the case when both ac¬ 
curately describe the scene. However, if users still 
on average prefer an ASR sentence fragment which 
describes the clip correctly versus a full recipe step 
which is unrelated to the scene, then this experiment 
still provides evidence of the quality of the align¬ 
ment model. 

4.2 Automatically illustrating a recipe 

One useful byproduct of our alignment method is 
that each recipe step is associated with a segment 
of the corresponding videoj^ We use a standard 
keyframe selection algorithm to pick the best frame 
from each segment. We can then associate this frame 
with the corresponding recipe step, thus automati¬ 
cally illustrating the recipe steps. An illustration of 
this process is shown in Figure]^ 

4.3 Search within a video 

Another application which our methods enable is 
search within a video. For example, if a user would 
like to find a clip illustrating how to knead dough, 
we can simply search our corpus of labeled clips, 

^ The HMM may assign multiple non-consecutive regions 
of the video to the same recipe step (since the background state 
can turn on and off). In such cases, we just take the “convex 
hull” of the regions as the interval which corresponds to that 
step. It is also possible for the HMM not to assign a given step 
to any interval of the video. 


Action knead 


184 results 



Figure 8: Searching for “knead dough”. Note that the videos 
have automatically been advanced to the relevant frame. 


and return a list of matches (ranked by confidence). 
Since each clip has a corresponding “provenance”, 
we can return the results to the user as a set of videos 
in which we have automatically “fast forwarded” to 
the relevant section of the video (see Figure for an 
example). This stands in contrast to standard video 
search on Youtube, which returns the whole video, 
but does not (in general) indicate where within the 
video the user’s search query occurs. 


5 Related work 


There are several pieces of related work. ( |Yu et al., 
2014| ) performs keyword spotting in the speech tran¬ 
script in order to label clips extracted from instruc¬ 
tional videos. However, our hybrid approach per¬ 
forms better; the gain is especially significant on au¬ 
tomatically generated speech transcripts, as shown 
in Figure]^ 

The idea of using an HMM to align instructional 
steps to a video was also explored in ( [Nairn et aU 
2014 ). However, their conditional model has to gen¬ 
erate images, whereas ours just has to generate ASR 
words, which is an easier task. Furthermore, they 
only consider 6 videos collected in a controlled lab 
setting, whereas we consider over 180k videos col¬ 
lected “in the wild”. 

Another paper that uses HMMs to process recipe 
text is ( jPruck and Pang, 2012| ). They use the HMM 
to align the steps of a recipe to the comments made 
by users in an online forum, whereas we align the 
steps of a recipe to the speech transcript. Also, we 
use video information, which was not considered in 
this earlier work. 

( jJoshi et al., 2006 ) describes a system to automat¬ 
ically illustrate a text document, however they only 
generate one image, not a sequence, and their tech¬ 
niques are very different. 

There is also a large body of other work on con¬ 
necting language and vision; we only have space to 





















stir the mashed avocados into 
the other mixture for a 
homemade guacamoie recipe 
that's perfect for any occasion! 



You couid even add some 
cayenne, jaiapenos, or ancho 
chiii for even more kick to add to 
your Mexican food night! 


Figure 7: Automatically illustrating a Guacamoie recipe from https : //www. youtube . com/watch?v=H7Ne3s2 021U 


briefly mention a few key papers. ( [Rohrbach et al., 


2012b| ) describes the MPII Cooking Composite Ac¬ 


tivities dataset, which consists of 212 videos col¬ 
lected in the lab of people performing various cook¬ 
ing activities. (This extends the dataset described in 
their earlier work, ( Rohrbach et al., 2012a[ ).) They 
also describe a method to recognize objects and ac¬ 
tions using standard vision features. However, they 
do not leverage the speech signal, and their dataset 
is significantly smaller than ours. 

dGuadarrama et al., 2013| ) describes a method for 
generating subject-verb-object triples given a short 
video clip, using standard object and action detec¬ 
tors. The technique was extended in ([Thomason 


et al., 2014] ) to also predict the location/ place. Fur¬ 


thermore, they use a linear-chain CRF to combine 
the visual scores with a simple (s,v,o,p) language 
model (similar to our affordance model). They ap¬ 
plied their technique to the dataset in ( jChen and 


[Dolan, 201 Ij ), which consists of 2000 short video 
clips, each described with 1-3 sentences. By con¬ 
trast, we focus on aligning instructional text to the 
video, and our corpus is significantly larger. 

( [Yu and Siskind, 2013[ ) describes a technique for 
estimating the compatibility between a video clip 
and a sentence, based on relative motion of the 
objects (which are tracked using HMMs). Their 
method is tested on 159 video clips, created under 
carefully controlled conditions. By contrast, we fo¬ 
cus on aligning instructional text to the video, and 
our corpus is significantly larger. 


6 Discussion and future work 

In this paper, we have presented a novel method for 
aligning instructional text to videos, leveraging both 
speech recognition and visual object detection. We 


have used this to align 180k recipe-video pairs, from 
which we have extracted a corpus of 1.4M labeled 
video clips - a small but crucial step toward build¬ 
ing a multimodal procedural knowlege base. In the 
future, we hope to use this labeled corpus to train 
visual action detectors, which can then be combined 
with the existing visual object detectors to interpret 
novel videos. Additionally, we believe that combin¬ 
ing visual and linguistic cues may help overcome 
longstanding challenges to language understanding, 
such as anaphora resolution and word sense disam¬ 
biguation. 
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